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Abstract 

With the remarkable development in inexpensive sequencing technologies and supporting computational tools, 
we have the promise of medicine being personalized by knowledge of the individual genome. Current 
technologies provide high throughput, but short reads. Reconstruction of the donor genome is based either on de 
novo assembly of the (short) reads, or on mapping donor reads to a standard reference. While such techniques 
demonstrate high success rates for inferring 'simple' genomic segments, they are confounded by segments with 
complex duplication patterns, including regions of direct medical relevance, like the HLA and the KIR regions. 
In this work, we address this problem with a method for assessing the quality of a predicted genome sequence for 
complex regions of the genome. This method combines two natural types of evidence: sequence similarity of the 
mapped reads to the predicted donor genome, and distribution of reads across the predicted genome. We define 
a new scoring function for read-to-genome matchings, which penalizes for sequence dissimilarities and deviations 
from expected read location distribution, and present an efficient algorithm for finding matchings that minimize 
the penalty. The algorithm is based on a formal problem, first defined in this paper, called Coverage Sensitive many- 
to-many min-cost bipartite Matching (CSM). This new problem variant generalizes the standard (one-to-one) 
weighted bipartite matching problem, and can be solved using network flows. The resulting Java-based tool, called 
SAGE (Scoring function for Assembled GEnomes), is freely available upon request. We demonstrate over simulated 
data that SAGE can be used to infer correct haplotypes of the highly repetitive KIR region on the Human 
chromosome 19. 



Introduction 

The inexorable drop in costs and rise in throughput of 
DNA sequencing is driving a future in which every indivi- 
dual person will have their genome sequenced, perhaps 
multiple times in their lifetimes [1]. Current high through- 
put technologies produce sequenced read fragments from 
donor genomes, which are then used for inferring the 
complete genomic sequence. The main algorithmic 
approaches for inferring a donor genome from a set of its 
sequenced reads are either based on de novo assembly 
[2,3], i.e. producing a parsimonious super-string that 
approximately contains most reads as its substrings, or 
based on mapping approaches [4-6], in which the algo- 
rithm takes the read set and a previously sequenced 
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reference genome (or a set of reference genomes), maps 
the reads to the reference, and uses the identified similari- 
ties and variations in order to predict the donor genome. 

While the accuracies of sequencing technologies keep 
improving and their usage costs keep decreasing, many 
of them still produce reads of relatively short lengths. 
Reconstruction of repetitive genomic regions using the 
mentioned approaches is considered more challenging, 
due to the fact that short reads may be de-novo 
assembled, or mapped to the reference, in multiple 
ambiguous manners. The difficulty even increases for 
diploid genomes, limiting the investigation of many 
important genomic regions, such as the killer cell immu- 
noglobulin like receptor (KIR) region (located in humans 
within the 1Mb Leucocyte Receptor Complex 19ql3.4, 
see Figure lb), the 3.6Mbp Human Leucocyte Antigen 
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(HLA) region and others, which exhibit highly repetitive 
sequences and extensive polymorphisms. 

Here, we address the problem of assessing the quality 
of a donor genome prediction given the set of its 
sequenced reads, confronting difficulties related to geno- 
mic regions of repetitive nature. We present a prediction 
quality measure a prediction quality measure which is 
independent of the approach used for generating the pre- 
diction. It combines scoring penalties related to both (a) 
imperfect alignments of the reads to the predicted region, 
and (b) deviations between the expected and actual read 
coverage of segments of the region. Our tool differs from 
previous ones which compare predictions to a known 
reference. For example, tools that evaluate the quality of 
de-novo assemblies [7] rely on comparing assembled 
genomes to known references. Mapping tools [8,9] can 
be used to provide a naive scoring function comparable 
to SAGE by summing up the best alignment score of 
each read. This naive scoring function only optimizes the 
alignment of the reads and does not take into account 
read coverage. Our results show the advantage of simul- 
taneously optimizing the combined alignment and cover- 
age score by comparing our tool to the naive approach. 

In order to evaluate the new cost function, we applied it 
to the KIR, a hyper-variable region known to be important 
for the immediate immune response in humans and higher 
mammals [10]. The KIR region is challenging to reconstruct 
from sequence read fragments due to its variable gene 
architecture (Figure la) and repetitive nature (Figure lb). 
We show that our scoring function allows us to correctly 
identify KIR haplotype templates in diploid genomes, differ- 
entiating correct predictions form incorrect ones based on 
their computed score, while the naive approach fails in 
many cases to predict the correct template. 

Our cost function for evaluating donor genome predic- 
tions is based on a new variant of a bipartite matching 



problem, entitled Coverage Sensitive many-to-many min- 
cost bipartite Matching (CSM), which is a many-to-many 
generalization of the classical min-cost (or max-weight) 
bipartite matching problem [11,12]. The formal definition 
of the CSM problem is given in the next section. While 
in general CSM is NP-Hard (see Additional File 1), we 
show a special "convexed" case for which CSM can be 
efficiently solved by reducing it to a network flow pro- 
blem, similar to many other variants of bipartite match- 
ing problems [12]. Optimal matching/flow algorithms 
were recently used by several related works to predict 
structural variations between genomes. Examples to such 
works include [13], in which min-cost flow was used to 
call copy number variations between a reference and a 
donor genome, [14], which used maximum-weight 
matching in order to reconstruct breakpoint sequences 
in long genomic insertions, and [15], which used maxi- 
mum-flow in order to apply a post-process refinement of 
simultaneous detection of structural variations in multi- 
ple genomes. 

Coverage Sensitive many-to-many min-cost 
bipartite Matching (CSM) 

The CSM problem is a many-to-many generalization of 
the classical min-cost bipartite matching problem [12]. 
We describe the problem in an abstract setting, and cast 
it to a read alignment problem in the next section. 

Consider arbitrary sets X and Y. A many-to-many 
matching (henceforth a matching) between X and Y is a 
set M of pairs {{x, y) e X x Y] (see Figure 2, (a), (b), (c). 
The coverage of an element x e X with respect to a 
matching M is c M (x) = | {y : (x, y) e M] \ . Symmetrically, 
Cm iy) = \{x ■ (x, y) e M}\ for y e Y . 

A coverage sensitive matching cost function (henceforth 
a cost function) w for X and Y assigns matching costs 
w m (x, y) for every pair (x, y) e X x Y , and coverage 
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(a) (b) (c) (d) 

Figure 2 Matching instance and its reduction to a cost flow network, (a) A bipartite graph corresponding to sets X and Y. In our particular 

application, X represents a set of reads and Y represents a set of genomic segments, where the expected coverage of each read is one and 

segments are expected to be uniformly covered. Each read x e X potentially maps to multiple segments, illustrated by the edges in the graph. 

An edge (x, y) has the weight w m (x, y), reflecting the best similarity between read x and a substring of of the genome starting at segment y. (b) 

and (c) depict two possible matchings. In (b), one of the y segments is covered by four reads, while the other two segments are covered by one 

read each. In (c), each segment is covered by two reads. It is possible that the matching in (b) is better in terms of sequence similarity, though is 

unrealistic in terms of segment coverage, which would make the matching in (c) preferable, (d) The corresponding network. Each pair of 

consecutive layers is a bipartite graph with capacities c and costs W as described. 
* ) 



costs w c (z, i) for every z e X U Y and every integer i > 
0. The cost of a matching M between X and Y with 
respect to w is given by 

w{M)= u>m{x,y)+ J2 w c {z,c M {z)) m 

(i,y)eM zeXUY V ' 

The CSM problem 

Input: A Matching Instance (X, Y, w) consisting of sets 
X, Y, and cost function w. 
Output: Compute CSM{X, Y, w) = mm^w{M) 

Note that CSM is a generalization of classical pro- 
blems in combinatorics. For example, consider the pro- 
blem of finding a maximum (partial one-to-one) 
matching on a bipartite graph G with vertex shores X, 
Y, and an edge set £. This problem can be solved by sol- 
ving CSM on the input X, Y using the following costs: 
set w c (z, 0) = w e (z, 1) = 0, and w c (z, i) = °° for all z e 
X U Y, i >1; set w m (x, y) = -1 for (x, y) e E and other- 
wise set w m (x, y) = °°. Similarly, CSM can also be used 
for solving the minimum/maximum weight variants of 
the bipartite matching problem. However, CSM is NP- 
hard in general (see Additional File 1), and therefore we 
do not expect to solve the general instance efficiently. 

CSM with convex coverage costs 

Let (X, Y, w) be a matching instance. We say that w has 
convex coverage costs if for every element z e X U Y 

and every integer i >0, w c {z,i) < Wc{z ' . We 

show here that CSM with convex coverage costs can be 



reduced to the poly-time solvable min-cost integer flow 
problem [11]. 

For x e X, denote d x = \{y : w m (x, y) <°°}|, and simi- 
larly d y = \{x : w m {x, y) <°°}| for y e Y . Denote 

d x = max 4 and d Y = max^ The re d uc ti 0 n builds the 

xsx ye y 

flow network N = (G, s, t, c, w), where G is the network 
graph, s and t are the source and sink nodes respec- 
tively, and c and w' are the edge capacity and cost func- 
tions respectively. The graph G = (V, E) is defined as 
follows (Figure 2d). 

. V = X U Y U C x U C Y U {s, i], where the sets 

C Y ={c\,c\ c\}, C Y = {c Y v c Y 2 c\}, and {s, t} 

contain unique nodes different from all nodes in X 
and Y . Note that we use the same notations for ele- 
ments in X and Y and their corresponding nodes in 
V, where ambiguity can be resolved by the context. 
.£ = £ 1 U£ 2 U£ 3 U£ 4 U£ 5 , where 

-Ei={(s, cf):cf€C x }, 

- E 2 = {(cf, x) : cf &C X , x€X, d x < i], 

- £3 = {(x, y) : x € X, y e Y, w m (x, y) < 00}, 
-E 4 = {{y, c Y ):y€Y, c Y e C Y , d Y <i], 

and 

-E s = {{c Y , t):c Y €C Y }. 

The capacity function c assigns infinity capacities to 
all edges in E x and £ 5 and unit capacities to all edges 
in E 2 , £3 and £4. The cost function w' assigns zero 
costs to edges in £1 and £5, costs w c (x, i) - w c (x, i - 1) 
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to edges (cf, x) e E 2 , costs w c (y, i) - w c (y, i - 1) to 
edges (y, cf) e £4, and costs w m (x, y) to edges (x, y) e 

£ 3 . For £' S E, denote u ''( £ ') = EX^. An iMfcger /Z. 



eeE> 



in AT is a function /: E — > {0, 1, 2, . . .}, satisfying that /(e) < 
c(e) for every e e £ (capacity constraints), and 
£ /(u, w) = E /(», «) for every v e y \ {5, f} Wow cow- 

serration constraints). The cost of a flow /in A/" is defined 

In what follows, let (X, Y, w) be a matching instance 
where w has convex coverage costs, and let N be its 
corresponding network. Due to the convexity require- 
ment, for every x e X and every integer i >0, 

u/(cf +1 , 1) — w'(cf, x) = (w c (x, i + 1) — u> c (x, i)) — (u> c (x, i) — w c (x, i — 1)) 
= ut c (x, i + 1) + if c (x, i — 1) — 2w c [x, i) > 0. 

Similarly, for every yeY and every integer z >0, 
u/(y, cj +l ) — u/{y, cj) > 0, and we get the following 
observation: 

Observation 1. Series of the form w/(cf, x), u/(cf, x), ... 
and u/(y, cf),u/(y, cf), ... are non-decreasing. Conse- 
quentially, for every e Q{(cf, x) ■. xex, i<i<d x ] and 
w'{E") <w'{E') u/(£") < u/(£'), and similarly for 

E'£{(y,cJ) : y€Y, l<i<d r ) and 

E" = {{y, cf) : yeY, 1 <i< |£'|}. 

Given a flow/in N, define the matching Mf = {(x, y) : (x, 
y) e E 3 ,f(x, y) = 1}. Denote E f x = {(cf,x) :/(cf,x) = 1} 

and E f y = {(y, cj ) : f(y, cj ) = 1} • Since for edges e e £ x U 
£ 5 we have that w'(e) = 0, and since for edges e e £ 2 U £ 3 
U £ 4 we have that /(e) e {0, 1} (due to capacity con- 
straints), we can write 



u/(f) = £/(*M*) = E ^ 



ee£ 2 UE 3 UE4 
/(e)=l 



(2) 



Given a non-infinity cost matching M between X and 
Y, define the flow/vf in N as follows: 

♦ For every (x, y) e £ 3 , / (x, y) = 1 if (x, y) e M, and 
otherwise _/(x, j) = 0; 

♦ For every (cf , x) e £2, /(cf , x) = 1 if c M (x) < i, 
and otherwise f(cf, x) = 0; 

♦ For every (y, cf ) e £4, /(y, cf ) = 1 if c M (y) < i, 
and otherwise f(y, cf) = 0; 

. For every (5, cf) e E u f(s, cf) = \{x :/(cf, x) = lj|; 
. For every (cf, t) e £5, /(cf, t) = |{y :/(y, cf) = 1)| . 



It is simple to assert that/vf is a valid flow in N (satis- 
fying all capacity and flow conservation constraints), and 
that Mf u = M . 

Claim 1. For every flow fin N, u/(f Mf ) <w'[f). 

Proof. From flow conservation constraints 

\E f x \ = \e"'\ = c Mf {x) for ever Y x e X, where in particular 

by definition we have that = {{cf, x) : 1 < i < c Mf {x)} 
Therefore, it follows from Observation 1 that 
w 1 ^" 1 ) < w'(£{) f° r ever Y x & X, and similarly it may be 
shown that u/^') <u/{E f x ) for every ye Y. Hence, 

xeX 

yeY 

< u/(M f ) + J2u/{E f x ) 



xeX 

yeY 



E ^ 2 W'(f). 



Denote A = A(X, Y, u>) = E 0) and note 

zeXUY 

that A depends only on the instance {X, Y, w) and not 
on any specific matching. 

Claim 2. For every matching M between x and Y, 
w'(f M ) = w(M) - A. 

Proof. For x e X, we have that 

and similarly u/(4 M ) = iy c (y, c M (y)) - w c (y, 0) for ye Y. 
Therefore, 

+ E u/ (4") 

= u/(M) 

+ E' 1 " C '>'' °)) 

rev 

= J] u, "( x ' Y) 

+ E u> c (z, c M (z)) 

- E ^( z ' °) 

E 4' - A. 



Claim 3. Let f be a minimum cost flow in N. Then, 
Mj* is a minimum cost matching between X and Y, and 
CSM{X, Y, w) = w'ifl) + A. 
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Proof. Since f is a minimum cost flow in N, 
u/(f ) < u/(f Mf ,) a < Vf/*), thus w'{f*) = u/[f Mf ,). Let M be a 
matching between X and Y, Again, from the optimality of 
f*> w'(f) < w'(f M ) and so u,(M r )-A a :V(/^)-»'cn<i</(A) a = 2 »(M)-A, 
and in particular w(Mf*) < w{M). Thus, Mf* is a mini- 
mum cost matching for (X, Y, w), and so 

CSM{X, Y, w) = w(M f .) = w'{f*) + A ■ 
□ 

Constructing CSM instance from read mapping data 

Consider a set of reads and a prediction of the genomic 
sequence (henceforth, the "prediction") from which the 
reads were extracted. It is assumed that the sequencing 
procedure produces reads with some sequencing error 
probability, and that read extraction positions along the 
genome adhere to some expected distribution. The prob- 
ability for extracting a read starting at a given position 
may depend on the sequential context at this position 
and its location along the genome. Given such probabil- 
ities, it is possible to compute for a given segment of the 
prediction an expected amount of extracted reads start- 
ing within this segment. Such an amount of expected 
reads will be referred to here as the expected coverage of 
the segment. Hence, we can argue that the reads well 
support the prediction in case it is possible to assign to 
each read a position within the prediction, from which it 
was presumably extracted, in a manner that (a) each read 
sequence approximately matches the substring of the 
prediction starting at the assigned position, and (b) for 
every segment of the prediction, the amount of reads 
assigned to positions within this segment does not devi- 
ate significantly from the expected coverage of the seg- 
ment. On the other hand, when no such position 
assignment can be found, it is suggestive that the predic- 
tion exhibits some variation with respect to the true 
genome. 

Given a predicted region, a mapping between the reads 
and the prediction is a function that assigns to each read a 
set of positions in the region from which it is possible to 
extract the read (with some allowed amount of sequencing 
errors). Software tools for producing such mappings exist 
(e.g. Bowtie [8]) and are widely used. Ideally, if the predic- 
tion is in fact the correct genomic sequence from which 
the reads were extracted, and this region is non-repetitive, 
it is expected that a mapping would assign to each read a 
unique position that is the true position from which it was 
extracted. Nevertheless, when the sequence contains 
repeats, and sequencing errors are not negligible, it is 
expected that some of the reads will be mapped to multi- 
ple positions (due to the repeats), while others may not be 
mapped to any position (due to sequencing errors). Given 
a mapping between the reads and the region, we define a 



read-to-genome matching as a function that selects for 
each read at most one corresponding position among its 
set of positions given by the mapping, from which it was 
presumably extracted. A read-to-genome matching better 
supports the prediction the more reads it matches to the 
genome, the higher the similarity is between reads and 
their matching positions, and the smaller the deviation is 
between the expected coverage and the coverage implied 
by the matching positions. 

The quality of a read-to-genome matching can be natu- 
rally evaluated using the CSM formalism described in the 
previous section. A matching instance (X, Y, w) can be 
generated, choosing X to be the set of reads, and Y to be a 
partition of the prediction into segments (where each ele- 
ment in Y corresponds to a segment in the partition). For 
each read x e X and each segment ye Y,w m (x, y) is set 
to the best sequence similarity score between x and a sub- 
string of the prediction starting at y (such similarity scores 
may be generated by tools such as Bowtie [8]), or set to °° 
if no substring starting at y is similar to x. The coverage 
cost function for a read xe X sets w c (x, 0) to some pen- 
alty added to the score in case x is unmatched, sets w c 
(x, 1) to 0 (no penalty is added when x participates in the 
matching), and w c (x, i) for i >1 to °° (a matching in which 
a read is assigned to more than one position is illegal, and 
has an infinite cost). For a segment ye Y, it is possible to 
compute the expected coverage c y of y, and generate a 
convex score function fli) whose minimum point is at i = 
c y , and set w c (y, i) = f (i) for every nonnegative integer i. 
The cost of an optimal matching for this instance can 
then serve as a quality measure for the prediction. 

Implementation 

We implemented the CSM algorithm as a java based 
tool named SAGE, a Scoring function for Assembled 
GEnomes. The inputs to SAGE are a set of reads, R, 
mapped to a genomic template, G, in the BAM format 
[16] along with a parameter file containing alignment 
costs, unmatched read penalty, genome segmentation, 
expected segment coverage values, and a choice of cov- 
erage cost functions (currently linear and polynomial 
cost functions). 

Results 

We tested SAGE on the hypervariable KIR region. The KIR 
region, while variable, is tightly organized and contains 
between 8 and 14 genes, and 2 pseudo-genes (Figure la) 
[17]. The genes are organized into two adjacent regions, 
each bordered by two anchoring genes/pseudo-genes: 
KIR3DL3 and 3DP1 for the centromeric region; 2DL4 and 
3DL2 for the telomeric region. Variability within KIR is 
expressed in the form of changing gene numbers, gene- 
copy numbers, and gene polymorphisms. There are two 
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broad types of KIR haplotypes- Type A and Type B- that 
are distinguished by their gene content. Type A haplotypes 
are characterized by the absence of the following genes: 
{KIR-2DL5, -2DS1, -2DS2, -2DS3, -2DS5, -3DS1}, while 
Type B haplotypes contain one or more of these genes [18]. 
Type B haplotypes can be split further into different sub- 
types, characterized by the gene content on the centro- 
meric-side and telomeric-side. The various (sub-)types of 
KIR haplotypes are denoted by {A, AB, BA1, BA2, BA2X, 
Bdel, B}. However, the typing is incompletely developed, 
and is likely to change as more data is acquired. 

To test the effectiveness of SAGE on a variety of hap- 
lotype types, we simulated reads from 27 known KIR 
haplotypes using GemSIM [19] with an error model 
learned from paired-end (100 x 2)bp reads generated by 
Illumina GA IIx with TrueSeq SBS Kit v5-GA [19]. The 
27 haplotype templates were taken from the IPD-KIR 
database [20]. The sequences of these templates were 
obtained experimentally by first separating the two hap- 
lotypes of an individual using fosmid-pools, determining 
the gene content and architecture of each haplotype 
using STS assays, and then finally sequencing the indivi- 
dual fosmids [21]. 

Before we ran SAGE, we mapped each read set, R, 
back to each template, G, using Bowtie. We ran Bowtie 
under the '-a' option with all other parameters set to the 
default, in order to obtain a set of all possible mapping 
locations and their corresponding alignment costs for 
each read, which was used as input into SAGE. The 
mapping position of a paired-end read was set to be the 
genomic index to which the first character of the first 
sub-read was aligned. The alignment cost for a complete 
(100 x 2)bp paired-end read varied between 0 and 180, 
with 0 corresponding to identity. When two paired-ends 
mapped in a concordant manner, the total alignment 
cost for the read was calculated by adding the alignment 
cost of both paired-ends. When a paired-end did not 
have a concordant mate, suggestive of incorrect archi- 
tecture, the alignment cost was further penalized by 
adding a cost of 90, which is the maximum penalty for 
one paired-end. The unmatched read penalty was con- 
stant for all reads and set to 100. On the other side, the 
genome G was partitioned to segments of fixed length 
of lOOObp (except for the last segment which may be 
shorter), with expected coverage per segment given by 

IRI 

X = lOOOjgj- (with the appropriate adjustment for the 

last segment), where \R\ and |G| denote the number of 
reads and the length of the genome, respectively. To 
allow for natural variation in coverage, the quadratic 
function, j{i) = (X - i) 2 , was chosen as the segment cov- 
erage cost function. 

To the best of our knowledge, SAGE is the first tool 
that scores templates given a set of reads. As there is no 



competing tool, we compared SAGE results against a 
naive approach that ignores coverage and sums up the 
best alignment score for each read to obtain a total 
score for each read set and template. The scores 
obtained by this approach will be referred to as the 
Bowtie scores below. 

Haploid templates 

As a first pass, we tested SAGE's ability to score hap- 
loid templates. We scored each of the 27 read sets 
against each of the 27 templates using SAGE. A visua- 
lization of the scores are shown in Figure 3, where the 
templates are organized by sequence similarity so that 
templates of the same type/sub-type are clustered 
together. Note that the matrix is not symmetric. Each 
row corresponds to the scores of a single read data set 
against a collection of haploid templates. As can be 
seen, SAGE always gets the top-score for the correct 
template. Moreover, the other templates from the 
same sub-type get progressively weaker scores. Major 
haplotypes fall within distinct blocks, but the scores 
also suggest a hierarchy within the subtypes that can 
be studied further. 

Dipolid templates 

To test scoring on more realistic templates, we simulated 
reads from 9 diploid individuals whose pair of haploid 
templates were obtained experimetally in Pyo et al. [21] 
and are in the IPD-KIR database [20]. The 9 diploid tem- 
plates from this study fell into one of 6 combination of 
sub-types. We scored each of the 9 simulated read sets 
against each of the 9 diploid templates using SAGE. In all 
but one case, SAGE (Figure 4a) and Bowtie (Figure 4b) 
predicted the correct diploid template of the donor. 
Furthermore, SAGE is better at predicting the sub-type 
of the donor template than Bowtie. When the donor tem- 
plate is not in database, as is usually the case in practice, 
SAGE will give a better score to templates that are more 
similar to the donor while Bowtie may not. For example, 
row 3 of Figure 4(a, b) show the scores when the donor 
template is of type A and BA1. Both SAGE and Bowtie 
correctly gave the best score to the diploid template 
G085-A/BA1. However, the template with the next best 
SAGE score was also of sub-type A/BA1, while the tem- 
plate with the next best Bowtie score was of subtype A/ 
BA2. 

In general, coverage plays an important role in deter- 
mining the correct haplotype. Figure 5(b-e) show the 
coverage plots when reads from donor template G085- 
A/BA1 are mapped to a template of the same sub-type 
(F06-A/BA1) and a template of a different sub-type 
(FH13-A/BA2) using SAGE and Bowtie. When mapped 
to templates of the same sub-type (Figure 5(b, d)), the 
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Figure 3 Scoring simulated reads against haploid templates using SAGE Each row contains the color-coded percentage from the top- 
score of a read-set mapped against 27 genomic templates. Black: top-score; Red: within 5% of top-score; Orange: < 10%; Yellow: < 20%; White: 
>20% below top-score. Sequences are ordered along the rows and columns so that sequences with the same (sub-)type are adjacent to each 
other. Templates of the same type are indicated by the blue boxes, and those of the same sub-type by light blue boxes. 



coverage plots for both SAGE and Bowtie show less 
variance when compared to the coverage plots of the 
other templates (Figure 5(c, e)). Bowtie does not take 
into account variance of coverage and scores the tem- 
plate of a different sub-type (FH13-A/BA2) higher than 
the template of the same sub-type (F06-A/BA1). On 
the contrary, SAGE penalizes for the variance in cover- 
age, and correctly predicts the sub-type of the donor. 



Furthermore, if several possible mappings of a read are 
given, SAGE can be used to determine the best map- 
ping. In Figure 5(b, c), we see less variability in the 
coverage plots from SAGE's matching compared 
against those of Bowtie's matching (Figure 5(d, e)). 
Therefore, even if Bowtie is able to determine the cor- 
rect donor template, it may not output the correct 
mapping. 
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Figure 4 Scoring simulated reads against diploid templates. Each row of a matrix represent scores from the same read sets mapped to 
different prediction templates. The scores are normalized so that the second best score in each row is equal to 1 and the worst score is equal to 
0. We normalize with respect to the second best score since it would be used to predict the haplotype in the absence of the best scoring (and 
presumably correct) template. Furthermore, the entries are color-coded accordingly- Black: top-score; Red: second top-score; Orange: within 10% 
of second top-score; Light Orange: < 20%; Yellow: < 30%; White: >30% below top-score. Both matrices are ordered according to template sub- 
types. Templates of the same type are indicated by the blue boxes, (a) SAGE scores (b) Bowtie scores. 



Running time 

For a data-set with n reads and a total of m read map- 
ping locations, SAGE scales as 0(nm + n 2 log n). Thus, 
on our data-sets with haploid genomes of average length 
166Kbp (166 lOOObp-segments), and ~ 24,900 reads, 
SAGE ran in 21 seconds. The running time increased to 
210 seconds for the average diploid genome (~ 332 
lOOObp-segments, ~ 49,800 reads). Running times were 
recorded using a 4 core Intel 2.66GHz processor with 
9Gb of RAM. 

Discussion and conclusions 

To the best of our knowledge, SAGE is the first tool 
that scores predicted donor templates given a set of 
sequenced reads. Our results on the KIR region show 
that SAGE can be used to predict the sub-type of the 
donor KIR template, and can be directly used for haplo- 
typing this region. Furthermore, SAGE scores the cor- 
rect template higher than even templates of the same 
sub-type. 

While we focused our attention on the KIR region, 
SAGE is general enough to be applied to any complex 
region. It is also possible to implement many different 
scoring functions, which would allow the user to obtain 
optimal matchings according to his own custom scores. 
For example, read unmatching penalties may be constant 
for all reads, or may be read-specific. A motivation for 
read specific costs is in the case where the sequencing 
phase produces some sequencing qualities for reads, and 
it is possible to "pay" less when not matching reads of 
lower sequencing quality. Similarly, it is possible to 
choose a segmentation of the prediction in which all seg- 
ments are of the same length, and uniform coverage is 



assumed, or one with variable segment lengths and possi- 
bly different coverage cost functions for each segment. 
A motivation to such complex segmentation is e.g. in the 
case where one tries to identify a specific structural varia- 
tion, such as a deletion of a segment of specific length 
around a specific region of the prediction. Setting lengths 
of segments in the examined region to the expected dele- 
tion length can increase the likelihood that an optimal 
matching would not add artifact matchings of reads to a 
long segment spanning the deleted segment, in order to 
compensate for low coverage of the deleted segment. 
Lastly, by using different coverage cost functions, it is 
possible to decide the rate in which penalty increases due 
to deviations of expected coverage, which may grow line- 
arly, polynomially, exponentially, or based on other prob- 
abilistic models, as long as the function satisfies the 
convexity requirement. 

Future work would involve extending the use of SAGE 
on real data. Some challenges in dealing with real data 
include obtaining the set of reads extracted from the 
region of interest (especially when sequencing data is 
likely taken from the whole genome) and providing the 
expected coverage. If we know the parameters of the 
sequencing run, we could use the target read coverage as 
the expected coverage; however, if that is unknown, we 
may be able to estimate the expected coverage from the 
number of reads we need to map to the region. For 
example, if we assume a uniform distribution of coverage, 
then the expected coverage per segment is simply the 
total length of the segment multiplied by jgj-. 

Although haplotype analysis of the KIR region is 
medically relevant, the genomic complexity (i.e. repeti- 
tive nature and variable gene architecture) of this region 
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Figure 5 Coverage plots for reads sampled from G085-A/BA1 templates, (a) genomic architecture of of G085-A/BA1, FH06-A/BA1, and 
FH13-A/BA2. SAGE coverage plots when reads are extracted from G085A/BA1 and mapped to (b) FH06-A/BA1 and (c) FH13 A/BA2. Bowtie 
coverage plots when reads are extracted from G085A/BA1 and mapped to (d) FH06-A/BA1 and (e) FH13-A/BA2. 



makes it difficult to do a complete analysis. Indeed, the and other complex regions of the genome remain a 
possible sub-types of this region have not been comple- worthwhile problem. SAGE takes the first step in recon- 
tely characterized. Thus, reconstruction of this region structing complex regions of the genome by providing a 
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scoring function for predicted templates based on their 
similarity to the true donor. Therefore, it might be pos- 
sible to obtain a complete reconstruction of the donor 
genome by iteratively refining predicted donor templates 
until SAGE scores are optimized. Furthermore, SAGE 
can also be applied for scoring de-novo assemblies and 
for comparing the accuracies of different assemblers. 

Additional material 



Additional file 1: NP-hardness of CSM 
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